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Abstract 

Next-generation sequencing teclinologies fiave been designed to discover rare and de novo variants and are an 
important tool for identifying rare disease variants. IVlany statistical methods have been developed to test, using 
next-generation sequencing data, for rare variants that are associated with a trait. However, many of these 
methods make assumptions that rare variants are in linkage equilibrium in a gene. In this report, we studied 
whether transmitted or untransmitted haplotypes carry an excess of rare variants using the whole genome 
sequencing data of 15 large Mexican American pedigrees provided by the Genetic Analysis Workshop 18. We 
observed that an excess of rare variants are carried on either transmitted or nontransmitted haplotypes from 
parents to offspring. Further analyses suggest that such nonrandom associations among rare variants can be 
attributed to population admixture and single-nucleotide variant calling errors. Our results have significant 
implications for rare variant association studies, especially those conducted in admixed populations. 



Background 

Next-generation sequencing technologies have become a 
major tool for identifying disease-associated rare variants 
[1]. Many statistical methods have been developed to test 
for association between rare variants and complex traits 
using next-generation sequencing data [2-8]. Most statisti- 
cal methods for rare variant association testing either do 
not address rare variant calling errors or indirectly assume 
that rare variants are correctly called. We studied the dis- 
tribution of rare variants in transmitted and untransmitted 
haplotypes from parents to their offspring in nuclear 
families using whole genome sequencing data from 
15 large Mexican American pedigrees provided by the 
Genetic Analysis Workshop 18 (GAW18). We observed 
an excess of rare variants falling on either transmitted or 
nontransmitted haplotypes from parents to offspring, sug- 
gesting linkage disequilibrium (LD) among rare variants 
and/ or single-nucleotide variant (SNV) calling errors. 
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Methods 

Data preparation 

The GAW18 data includes 464 Mexican American indi- 
viduals from 16 large pedigrees with half genome 
sequence data available. Our goal is to study the LD 
among rare variants using the sequencing data by com- 
paring the number of rare variants on transmitted and 
untransmitted haplotypes. In our analysis, we excluded 
all the single-nucleotide polymorphisms (SNPs) with a 
minor allele frequency (MAP) >0.01. We also excluded 
these SNPs with (a)a missing genotyping rate >5%; (b) 
Hardy- Weinberg equilibrium (HWE) test p values 
<0.001; and (c) observed Mendelian errors. 

GAW18 also provides hypertension data. We used the 
hypertension status provided by GAW18, which is based 
on blood pressure measurements at 4 study exams in the 
past 20 years. An individual is defined as hypertensive if 
the individual's systolic blood pressure (SBP)>140, or dia- 
stolic blood pressure (DBP)>90, or on antihypertensive 
medications at one of 4 exams and as normotensive 
otherwise. If an individual has missing values for all 4 
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exams, the individual's hypertension status is considered 
as missing. 

Analysis method 

We first selected family trios of "mother, father, and 
child" from each of the 16 large pedigrees. Among the 16 
pedigrees, 1 pedigree does not include any trio who has 
sequencing data available. Thus, the family trios were all 
from 15 pedigrees. Similar to the traditional transmission 
disequilibrium test (TDT), we examine the transmission 
of a rare variant allele. Because our analysis focuses on 
those variants with a MAF<1%, we expect no more than 
1 parent to be heterozygous. If both parents are heterozy- 
gous for a variant, this variant is excluded from our ana- 
lysis. Thus, any minor transmitted alleles to an offspring 
from a parent must fall on the same haplotype because of 
no recombination. We examine all the rare variants in a 
gene or a region simultaneously instead of examining 1 
variant at a time. Let m be the number of trios regardless 
of an offspring's disease status in the data and L be the 
number of rare variants in a gene or a genomic region. 
We denote to be the total number of transmitted 
minor alleles and W21 to be the total number of non- 
transmitted minor alleles across the L variants in m trios. 
In this way, mi2 is equal to the total number of rare var- 
iants falling in transmitted haplotypes, and is the 
total number of rare variants in nontransmitted haplo- 
types, respectively. If rare variants are randomly distribu- 
ted in haplotypes (or, equivalently, there is no LD among 
rare variants), we would expect mi2 =m2i. We can use 

the TDT statistic T = (^12 - ^21) testing the ran- 

domness among the rare variants. The statistic T follows 
a chi square distribution with 1 degree of freedom (DF). 
When only the trios with affected offspring are included, 
this test will test the association between rare variants 
and disease status. 

Results 

We applied the proposed methods to the GAW18 
sequence data. After quality control, there are 2,749,275 
rare SNPs remaining for association analysis. We identified 
5 trios with affected offspring and 36 trios with unaffected 
offspring. Because the number of affected offspring trios is 
small, we only analyzed the unaffected offspring trios. 
Because some of the trios were selected from the same 
pedigrees, we analyzed 15 independent unaffected off- 
spring trios. This was done by randomly selecting 1 family 
trio if multiple family trios were available for a pedigree. 
We grouped SNPs into genes or regions according to the 
Ensembl software (http://www.ensembl.org). As a result, 
we had 38,091 genes and regions. The average number of 
SNPs in a gene or a region was 92. 



Figure lA presents the Q-Q plot of -loglO(p value) for 
all the genes or regions across the genome and we 
observed a substantial inflation of the test statistic, suggest- 
ing rare variants are not randomly distributed on trans- 
mitted and nontransmitted haplotypes. We examined 
where these significant genes were located in the genome 
using the Manhattan plot (Figure IB). We observed that 
these significant genes are distributed across the genome 
evenly rather than clustered in a few regions, suggesting 
our test is not testing for linkage. We then examined the 
20 most significant genes, which are presented in Table 1. 
Among these top 20 genes, 18 genes have more rare var- 
iants on nontransmitted than on transmitted haplotypes 
and 2 are the other way round. We hypothesized that the 
excess of rare variants on nontransmitted haplotypes is 
probably caused by SNV calling errors. The reason is that 
when a rare variant is observed in an offspring, it should 
also be observed in 1 of the offspring's parents. Otherwise 
Mendelian error examination will filter out this variant. 
However, Mendelian error examination will not filter out 
any SNV calling errors on nontransmitted haplotypes 
unless all grandparents are in pedigrees and their sequen- 
cing data are available. Thus, an excess of rare variants on 
nontransmitted haplotypes may be expected. However, this 
does not explain the excess of rare variants on transmitted 
haplotypes for the 2 genes: CD247 and KIFIB. Conse- 
quently, we examined whether the excess of rare variants 
was caused by some specific families. In fact, this is true, as 
we observed that these significant p values could be attrib- 
uted to a small number of transmitted or nontransmitted 
haplotypes for each gene (Table 1). We then tried to 
answer why these haplotypes carry significantly more rare 
variants. Specifically, we examined two genes, CD247 and 
KIFIB, which have an excess of rare variants on trans- 
mitted compared to nontransmitted haplotypes. Because 
the transmitted rare variants were present in offspring and 
1 of 2 parents, these variants are less likely to be mistakenly 
called. For the CD247 gene, we identified 1 transmitted 
haplotype carrying 175 rare variants. We searched the 1000 
Genomes Project database (http://browser.lOOOgenomes. 
org/) and identified 170 of the 175 variants as present in 
the 1000 Genomes Project database. Among these 170 var- 
iants, 154 variants are present only in African samples; the 
other 16 variants are present in Africans and in other eth- 
nic populations in the 1000 Genomes Project database. 
Similar results were observed for the KIFIB gene. Thus, 
our result suggests that the excess of rare variants in haplo- 
types, or the LD among the rare variants, is caused by 
population admixture with African ancestry populations. 

Discussion 

It has been suggested that rare variants are likely inde- 
pendent in general [6]. However, our analysis suggests 
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Figure 1 A, Q-Q plot of -Iog10(p value) based on randomly selected 15 Independent nuclear families;B, Manhattan plot of the genes 
or regions with -Iog10(p value)>3 based on thelS nuclear families. The horizontal line represents -Iog10(p value)=3 



Table 1 P values of the top 20 genes based on 15 nuclear families 



Chr 


Gene 


# of rare 
variants 
transmitted 


# of rare 
variants non 
transmitted 


p Value 


# of trios 
contributing most 
statistical evidence 


# rare variants 
transmitted in the 
most contributing 
trios 


# of rare variants 
nontransmitted in the 
most contributing trios 


1 1 


DLG2 


279 


876 


445 X 


,0-69 


4 


44 


578 


5 


CDH18 


117 


473 


1.23 X 


10-^' 


2 


12 


272 


1 1 


RP11-179A16.1.1 
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1.11 X 




4 


30 
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11 
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1.33 X 


10-"= 


4 


16 
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7 
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200 
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1.47 X 


,0-40 


2 


21 
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13 


PCDH9 


73 
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10-" 


1 


13 


217 


3 


STAGl 


31 


233 


1.75 X 
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3 


3 


207 


11 


NELLl 


114 


393 


2.93 X 


10-" 


3 


26 


232 


1 


CD247 
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12 


6.58 X 


10-" 


1 


175 


2 


1 


0SBPL9 


19 


197 


9.20 X 


10-" 


1 


0 


166 


1 1 


GRiA4 


28 


212 


1.56 X 


10-" 


3 


6 


156 


3 


EPHA6 


151 


430 


5.53 X 


10-" 


6 


95 


366 


5 


CDH12 


95 


332 


1.88 X 


10-" 


2 


15 


190 


1 


KIF1B 


199 


28 


7.44 X 


,0-30 


1 


180 


1 


11 


RP11-124G5.3.1 


16 


168 


3.83 X 


10-^^ 


1 


1 


146 


11 


UVRAG 


35 
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5.09 X 


10-^^ 


4 


5 
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1 
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60 


253 


1.04 X 


10-^' 


3 


27 
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1 1 
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40 
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1.49 X 


10-"' 


2 


10 


161 


5 


RP11-454P21.1.1 


24 
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1.57 X 


,0-26 


3 


4 


158 


15 


RP11-387D10.2.1 


12 


145 


2.55 X 


10-^" 


1 


0 


132 



that substantial LD among rare variants could be intro- 
duced by population admixture. Wang and Zhu [9] sug- 
gested that there are substantial genotype calling errors, 
especially for rare and de novo variants, in whole 
sequencing data. But genotype calling errors are unable 
to explain the excess of rare variants carried by a few 
haplotypes in this data. When association tests for rare 
variants are conducted in admixed populations such as 
African Americans and Mexican Americans, the LD 



among rare variants created by population admixture 
can generate false-positive findings. Our results also sug- 
gest that even the TDT may not overcome this problem 
if multiple rare variants are analyzed together. 

Conclusions 

In summary, our analysis indicates that substantial LD 
among rare variants can be created by population 
admixture and by genotype calling errors. Novel 
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statistical approaches for rare variant association analysis 
are required to account for the LD among the rare var- 
iants because of either population admixture or geno- 
type calling errors. Family data have been suggested 
as having many statistical advantages in detecting rare 
disease variants [4,10] and may help address these 
problems. 
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